- 
                Notifications
    You must be signed in to change notification settings 
- Fork 524
Improve word tokenization for non-Latin characters #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| '\u0000-\u002F\u003A-\u0040\u005B-\u0060\u007B-\u007E'; // Basic Latin | ||
| charsCannotBecomeWord += '\u00A0-\u00BF\u00D7\u00F7'; // Latin-1 Supplement | ||
| charsCannotBecomeWord += '\u02B9-\u02DD\u02E5-\u02FF'; // Spacing Modifier Letters | ||
| charsCannotBecomeWord += '\u0300-\u036F'; // Combining Diacritical Marks | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... this can't be right. Treating combining diacritics as word breaks is definitely wrong for European languages that I am familiar with (and I would've thought that it was wrong universally), since it leads to outcomes like this:
> a = 'festou\u0300'
'festoù'
> b = 'impube\u0300re'
'impubère'
> diff.diffWords(a, b)
[
  { count: 1, added: undefined, removed: true, value: 'festou' },
  { count: 1, added: true, removed: undefined, value: 'impube' },
  { count: 1, value: '̀' },
  { count: 1, added: true, removed: undefined, value: 're' }
]
That diff ought to just show the one word in the old text being deleted and a totally new word from the new text being added, but instead it shows us preserving the code point for the grave accent in the middle of the word and adding and deleting stuff around the accent. That's nonsense from the perspective of French or any other language I know with accents; the è (an e with a grave accent) in impubère is simply a letter in the word, just like the b and r next to it, and doesn't semantically represent a word break.
Do combining diacritics work differently in Korean (... if they even exist in Korean?) or some other language you're familiar with such that it makes sense to treat them as word breaks? I am afraid I don't speak any non-European languages and am a bit out of my depth! I can imagine in theory a language where accents get used as punctuation, but I have never encountered such a thing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't considered such cases due to my limited experience with other languages. It seems like an aspect that requires further thought.
| const tokens = []; | ||
| let prevCharType = ''; | ||
| for (let i = 0; i < value.length; i++) { | ||
| const char = value[i]; | ||
| if (spaceRegExp.test(char)) { | ||
| if(prevCharType === 'space') { | ||
| tokens[tokens.length - 1] += ' '; | ||
| } else { | ||
| tokens.push(' '); | ||
| } | ||
| prevCharType = 'space'; | ||
| } else if (cannotBecomeWordRegExp.test(char)) { | ||
| tokens.push(char); | ||
| prevCharType = ''; | ||
| } else { | ||
| if(prevCharType === 'word') { | ||
| tokens[tokens.length - 1] += char; | ||
| } else { | ||
| tokens.push(char); | ||
| } | ||
| prevCharType = 'word'; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need logic changes here? It's not obvious to me why the fix for Korean text should involve anything more than just treating Korean letters as letters instead of word boundaries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is also an excessive change. My intention was to separate characters that should exist as single characters into individual tokens. (For example: U+0021, U+0022, ...)
| const char = value[i]; | ||
| if (spaceRegExp.test(char)) { | ||
| if(prevCharType === 'space') { | ||
| tokens[tokens.length - 1] += ' '; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the edge case where there's a long run of spaces in the text somewhere, this is going to take O(n^2) time where n is the number of consecutive spaces.
It also rewrites all space characters of any kind to an ordinary ' '.
Both these aspects of the behaviour here seem wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It seems like an excessive change.
| If I understand right, the problem is that right now we treat all CJK characters as if they were punctuation marks / word breaks, and the fix here treats them as letters instead. But: 
 I'll come back to this in due course. Would love to get your input in the meantime, @jihunleekr, but I understand if in the 2 years since you opened this PR you've lost interest! | 
Diff.diffWordsis not working on non-Latin characters like Korean.